Tokenize whitespace #1570

JohnnyMorganz · 2024-12-16T20:17:55Z

We would like to develop tooling that can consume the syntax tree and perform transformations whilst preserving the underlying structure of the source code. This requires the lexer / parser to retain enough information to be able to re-print the original code.

The first step towards this is to ensure the lexer preserves the input structure. In this commit, we enable tokenization of whitespace, stored in a new Lexeme of type Whitespace. Multiline whitespace is all stored in a single Lexeme.

JohnnyMorganz · 2024-12-16T20:18:45Z

Ast/src/Lexer.cpp


        if (updatePrevLocation)
            prevLocation = lexeme.location;

        lexeme = readNext();
        updatePrevLocation = false;
-    } while (skipComments && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment));
+    } while (skipTrivia && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment || lexeme.type == Lexeme::Whitespace));


Decided to merge skipComments together into a skipTrivia toggle

However, also open to splitting them into separate toggles if it makes more sense, e.g. skipWhitespace

Actually, given the test failures and the invasiveness of the change, decided to end up splitting into skipWhitespace separately.

But, if the inverse actually makes sense, we can change it :)

vegorov-rbx · 2024-12-16T21:25:36Z

Changes to lexer/parser on the hot path require benchmarking.

If you don't have a large codebase to do it on, we can measure it for you, but it will take time.

JohnnyMorganz · 2024-12-17T21:22:02Z

Some benchmarking after running on a ~1400 KLOC repository. This is 10 runs of .luau-compile --null . and reading the time for parsing that is printed (note: which is 2dp)

Command	Mean ± std.dev [s]	Min [s]	Max [s]
`./luau-compile.old --null .`	0.416 ± 0.017	0.39	0.45
`./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False .`	0.423 ± 0.021	0.39	0.46
`./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True .`	0.447 ± 0.012	0.43	0.46

And decided to run in hyperfine if it's a helpful comparison:

$ hyperfine -N --warmup 1 "./luau-compile.old --null ." "./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False ." "./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True ."
Benchmark 1: ./luau-compile.old --null .
  Time (mean ± σ):      2.043 s ±  0.048 s    [User: 1.083 s, System: 0.648 s]
  Range (min … max):    1.947 s …  2.121 s    10 runs

Benchmark 2: ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False .
  Time (mean ± σ):      2.062 s ±  0.066 s    [User: 1.091 s, System: 0.648 s]
  Range (min … max):    1.979 s …  2.172 s    10 runs

Benchmark 3: ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True .
  Time (mean ± σ):      2.077 s ±  0.064 s    [User: 1.114 s, System: 0.648 s]
  Range (min … max):    2.026 s …  2.249 s    10 runs

Summary
  ./luau-compile.old --null . ran
    1.01 ± 0.04 times faster than ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False .
    1.02 ± 0.04 times faster than ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True .

If I keep running it though the numbers do fluctuate. Let me know if I should do something different to benchmark

vegorov-rbx

I don't think it makes sense to preserve specific whitespace given the overhead.

We already able to reproduce files with original formatting mostly only given the location information (that's how Transpiler works).

I guess we are losing some info like mixed tabs/spaces and maybe garbage on the end of lines.

Since whitespace lexemes are not even captured yet, I wonder if you can fill in the blanks in a separate pass based on that location information of the nodes.

Or maybe we don't need to even build that structure and when we output the whitespace we can go and look what kind of symbols were used in the original file (even if AST nodes were moved around it should be reconstructable).

It might be also be acceptable in the current form if maybe we optimize the thing that is causing the perf overhead (maybe instead of checking 5 token types we can pack those enum values together and do unsigned(token - unimportant_start) <= (unimportant_last - unimportant_start) or something else.

JohnnyMorganz · 2024-12-19T18:01:35Z

Thanks, this makes sense, will take a look at this. For context, my original thinking was to make the lexer retain all source information so that given a stream of just lexer tokens, we could recreate the original input. Then, given an AST node and a set of related lexer tokens (would have to be stored during parsing), we'd be able to recreate exactly how that node was written.

Indeed, this would be important so that we could preserve the exact style of whitespace (spaces vs. tabs, leading/trailing whitespace, etc.).

But, the general idea of "filling in the gaps" with just location info is interesting. I did not explore this much, I'll try playing around with that further.

Or maybe we don't need to even build that structure and when we output the whitespace we can go and look what kind of symbols were used in the original file (even if AST nodes were moved around it should be reconstructable).

This would probably be good if the end goal was just to reconstruct the input, but I imagine if we did not build some sort of structure then the interface to perform code transforms would be a lot harder to work with, especially if you want to modify how the whitespace is. I can see it being doable though.

JohnnyMorganz added 2 commits December 16, 2024 20:00

Tokenize whitespace in Lexer

00a5e2f

Check contents of lexeme in Lexer test

158e569

JohnnyMorganz commented Dec 16, 2024

View reviewed changes

JohnnyMorganz marked this pull request as draft December 16, 2024 20:25

JohnnyMorganz added 2 commits December 16, 2024 21:17

Revert skipComments -> skipTrivia

a14ee98

Introduce skipWhitespace

bba1afc

JohnnyMorganz marked this pull request as ready for review December 17, 2024 21:22

vegorov-rbx requested changes Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize whitespace #1570

Tokenize whitespace #1570

JohnnyMorganz commented Dec 16, 2024

JohnnyMorganz Dec 16, 2024

JohnnyMorganz Dec 16, 2024

vegorov-rbx commented Dec 16, 2024

JohnnyMorganz commented Dec 17, 2024

vegorov-rbx left a comment

JohnnyMorganz commented Dec 19, 2024

Tokenize whitespace #1570

Are you sure you want to change the base?

Tokenize whitespace #1570

Conversation

JohnnyMorganz commented Dec 16, 2024

JohnnyMorganz Dec 16, 2024

Choose a reason for hiding this comment

JohnnyMorganz Dec 16, 2024

Choose a reason for hiding this comment

vegorov-rbx commented Dec 16, 2024

JohnnyMorganz commented Dec 17, 2024

vegorov-rbx left a comment

Choose a reason for hiding this comment

JohnnyMorganz commented Dec 19, 2024